Introduction

The dynamics of housing price are a subject of unpredictability. More specifically, there is a complex interplay of various factors contributing to the final market price of a individual property as well as influencing the housing market at large. Regarding Chicago city,the Institute for Housing Studies at Depaul University has identified that the city housing market is currently facing 3 main challenges: loss of affordable rental supply, rising costs for new and existing homeowners, and increasing housing needs of older adults. As such, this report would attempt to achieve 2 main objectives in order to help supporting Chicago policy markers: determine main factors contributing to change in housing price, and segment the Chicago housing markets.

Data Sources Overview

We use 2 data sources, we note that the metadata has the descriptions of all relevant variables and is included in the Appendix section.

  1. Cook County Housing Price Index
  • Source: Institute for Housing Studies at DePaul University.

  • Description: This dataset contains information on housing prices across all community areas in Chicago city between Q1 1997 and Q2 2023, offering a comprehensive view of the housing market trends. The dataset contains 4 sheet of information being:

    • Summary information: overview information of submarket housing price being change since 2000, median sales price 2022 Q3 - 2023 Q2 (current), etc.
    • Quarterly Index Results: relative housing price of every quarter year of all submarkets compare to that of Q1 2000.
    • Sumarket Definitions: information regarding boundaries of submarket, being which CAs are fully within, majority within, and partially within the boundary of the submarket.
    • Sumarket Annual Sample Size: the number of sample size the Institution used to estimated the information year over year for every sub-market.
  • Variables: Median housing prices per community area per quarter of every year (from 2017 to 2022). The variables are estimated using other variables in the dataset.

  • Usage: The housing price data will serve as the dependent variable in our analysis, against which the impact of various independent variables will be measured.

  • Reference: Institute for Housing Studies - DePaul University. (n.d.). Cook County House Pric Index. Cook County House Price Index.https://price-index.housingstudies.org/

  1. Community Data Snapshots (including cvs datasets and Chicago community areas’ PDF reports)
  • Source: Chicago Metropolitan Agency for Planning (CMAP).

  • Description: This dataset encompasses a range of economic and demographic variables from various Chicago community areas, offering a detailed view of local conditions and trends spanning six years (2017-2022).

  • Variables: there are 6 main categories of variables: Population and Households, Housing Characteristics, Transportation, Employment, Land Use, Change Over Time. Furthermore, each category of variables contain different group of variables with each variables group have from 1 to dozen of variables. For example, the Population and Households category contains General Population Characteristics group which have Total Population variables, Total Households variable, etc. We recommend accessing the a snapshot of a community area directly to understand detailed description of the variable.

  • Usage: These variables will be used to understand the micro-level economic and demographic factors affecting housing prices in specific Chicago communities.

  • Reference: Community Data snapshots Raw data (2014 - 2022). (n.d.). https://datahub.cmap.illinois.gov/maps/2a0b0316dc2c4ecfa40a171c635503f8/about

Ethical Consideration & Stakeholders

We believe that this report concerns with 3 main ethical considerations: Data Ownership and Privacy, Data Ownership and Usage, and Community and Individual Welfare. In addition, we believe that there are 4 main groups of stakeholders: the Chicago Community Areas (CAs) Residents, Chicago Policy Makers and Urban Planners, Academic and research community, and the General public.

Data Import, Cleaning, and Manipulation

Our analysis is conducted in R studio, leveraging these packages: tidyverse, caret, factoextra, leaps, glmnet, randomForest, partykit, openxlsx, janitor, tibble, magrittr, purrr, plotly, corrplot, rgl, and scales.

Import the data

Independent variables: We first import our independent variables dataset from CMAP, which contains information regarding a range of economic and demographic variables from various Chicago CAs between 2017 to 2022 from Community Data Snapshots dataset. We access and download these datasets into our local devices. We then access these datasets using R:

# Read the data (filter out columns with na values) and merge them together 
file1 <- read_csv("dataset/2017.csv") %>% select(where(~ all(!is.na(.))))
file2 <- read_csv("dataset/2018.csv") %>% select(where(~ all(!is.na(.))))
file3 <- read_csv("dataset/2019.csv") %>% select(where(~ all(!is.na(.))))
file4 <- read_csv("dataset/2020.csv") %>% select(where(~ all(!is.na(.))))
file5 <- read_csv("dataset/2021.csv") %>% select(where(~ all(!is.na(.))))
file6 <- read_csv("dataset/2022.csv") %>% select(where(~ all(!is.na(.))))

Dependent variables: We then import our dependent variables dataset from Institute for Housing Studies at DePaul University, which contains information on housing prices across all sub-markets (each sub-market contains fully some CAs) in Chicago city between Q1 1997 and Q2 2023, offering a comprehensive view of the housing market trends. We access and download the dataset into our local devices. We then access the dataset using R:

# access original data
path <- "Dataset/2023_q2_price_index_data.xlsx"
sheets <- openxlsx::getSheetNames(path) 
housing_dataset <- lapply(sheets, openxlsx::read.xlsx, xlsxFile=path) 

Since the dependent dataset comes in the form of an excel files with multiple sheets, thus, we create 1 dataset associated with each of these sheets.

# assigning names to original data sheets
names(housing_dataset) <- sheets

## Summary sheet
summary_info <- subset(housing_dataset$Summary, select = c(-1)) %>% 
  # set CA as row names
  column_to_rownames(var = names(.)[1]) %>% 
  # remove aggregate data for the cook county, the city, and the suburb
  slice(1:(n() - 3))

## Quarterly index result sheet
index <- housing_dataset$Quarterly_Index_Results %>% 
  # set column names
  row_to_names(row_number = 1) %>%
  # set row names
  remove_rownames() %>% 
  column_to_rownames(var = names(.)[1]) %>% 
  # convert to numeric data types
  mutate(across(where(is.character), ~ as.numeric(.))) %>% 
  # Convert from range 0 to 100 into 0 to 1. 
  mutate(across(everything(), ~ .x / 100)) %>% 
  # remove aggregate data for the cook county, the city, and the suburb
  subset(select = -c(1:3))

## Submarket_Definitions sheet
Submarket_Definition <- subset(housing_dataset$Submarket_Definitions, select = c(2:3)) %>% 
  # remove NA values 
  na.omit() %>% 
  # set CA as row names
  column_to_rownames(var = names(.)[1])  

## Submarket_Annual_Sample_Size sheet
Submarket_Annual_Sample_Size <- housing_dataset$Submarket_Annual_Sample_Size %>% 
  # remove aggregate data for the whole city
  select(-ncol(.)) %>% 
  slice(1:(n() - 1)) %>% 
  # use CA as column names
  row_to_names(row_number = 1) %>% 
  # set row names
  remove_rownames() %>% 
  {
    rownames(.) <- .[[1]]
    .[-1]
  }

Data Cleaning

Independent variables: We filter out variables that do not exist in all years and merge all independent data set dataframes. Thus we have:

# Identify common columns for merging
common_cols <- Reduce(intersect, list(names(file1), names(file2), names(file3), names(file4), names(file5), names(file6)))

# Subset Data Frames to Common Columns & Arrange GEOG as the First Column:
file1_common <- file1[, common_cols] %>% select(GEOG, everything())
file2_common <- file2[, common_cols] %>% select(GEOG, everything())
file3_common <- file3[, common_cols] %>% select(GEOG, everything())
file4_common <- file4[, common_cols] %>% select(GEOG, everything())
file5_common <- file5[, common_cols] %>% select(GEOG, everything())
file6_common <- file6[, common_cols] %>% select(GEOG, everything())

# Merge all datasets together
housing_data <- rbind(file1_common, file2_common, file3_common, 
                           file4_common, file5_common, file6_common)

housing_data$Year <- as.factor(housing_data$Year) # change years to factors

Dependent variables: Within the Cook County Housing Price Index dataset, we use the summary information sheet first to estimate the median housing price for each sub-market for each quarter of every years since Q1 1997 to W2 2023.

# 2000 Q1 price by multiply change from 2000 and current median sale price.
summary_info$`2000_price` <- summary_info$Change.since.2000 * summary_info$`Median.Sales.Price.2022.Q3.-.2023.Q2`

# calculate the price data for each sub-market for each quarter years. 
submarket_price <- as.matrix(index) %*% diag(summary_info$`2000_price`) %>% 
  set_colnames(colnames(index)) %>% as.data.frame()

Given the price data for each sub-market since Q1 1997 to Q2 2023 and each sub-market contains fully some CAs, we determine the median housing price for each CA between Q1 1997 to Q2 2023. For example, given CA A lies fully within sub-market B, as such, the estimated housing price in all time stamps of B will be that of the A.

# create a dataframe contains all the CA within each sub-market with each CA being a datapoint. 
CA_within <- Submarket_Definition %>%
  # create row name index
  rownames_to_column(var = "Sub-Market") %>%
  # seperate out each CA. 
  separate_rows(`Municipalities.or.Chicago.Community.Areas.Entirely.Within*`, sep = ",") %>%
  # Group by each row name index
  group_by(`Sub-Market`) %>%
  # create each CA name being each datapoint. 
  mutate(row_id = row_number()) %>%
  pivot_wider(names_from = row_id, values_from = `Municipalities.or.Chicago.Community.Areas.Entirely.Within*`, names_prefix = "CA_within_") %>% 
  # transpose the dataset
  t() %>% 
  # make the colnames
  row_to_names(row_number = 1) %>% 
  # change to dataframe type
  as.data.frame()

# create a CA price dataset
# Create an empty list to store datapoint
CA_price <- list()
# Loop through each sub-market price data
for (market in colnames(submarket_price)){
  # loop through each CA name
  for (CA in CA_within[[market]]){
    # skip NA value
    if (is.na(CA)){next}
    # add the sub-market price vector to the CA price vector that lies entirely within 
    CA_price[[`CA`]] <- submarket_price[[market]]
  }
}

# Update the CA price dataframe column names
CA_price <- data.frame(CA_price)
colnames(CA_price) <- gsub("X\\.", "", colnames(CA_price))
colnames(CA_price) <- gsub("\\.", " ", colnames(CA_price))

# Update the CA price dataframe row names
rownames(CA_price) <- rownames(submarket_price)
# Convert rownames to a column named 'YearQuarter'
CA_price <- CA_price %>% 
  rownames_to_column(var = "YearQuarter")

In our later analysis, our time unit will be that of every year since 2017 to 2022. Hence, for the CA_price dataframe that have median house price for each quarter year, we take the average of all quarter for each year as annual datapoints.

# Convert quarter price data into annual price data by taking the average of quarters data points. 
price_data <- CA_price%>%
  select(-starts_with("Unnamed")) %>%
  gather(key = "Location", value = "Price", -YearQuarter) %>%
  separate(YearQuarter, into = c("Year", "Quarter"), sep = 4) %>%
  mutate(Year = as.numeric(Year)) %>%
  filter(Year %in% 2017:2022) %>%
  group_by(Year, Location) %>%
  summarize(AveragePrice = mean(Price, na.rm = TRUE), .groups = 'drop')

Data Merging and Manipulation

  1. Data Merging: We merge the independent and dependent variables dataset using name of CAs that existed in both datasets as the key variable. It should be noted that CA all have unique names.
# Only keep GEOG values that are present in both datasets
matching_geog <- intersect(housing_data$GEOG, price_data$Location)

# Filter the price data for matching GEOG values
price_data_filtered <- filter(price_data, Location %in% matching_geog)
price_data_filtered$Year <- as.factor(price_data_filtered$Year)

# Merge the datasets and filter out unnecessary variables 
clean_data <- housing_data %>% filter(GEOG %in% matching_geog) %>%
  left_join(price_data_filtered, by = c("Year", "GEOG" = "Location"))

Our final clean_data dataset is a comprehensive dataset of both dependent, being median house price data, and independent variables, being various CA’s features.

  1. Data Manipulation

We remove ‘isolated’ variables, being variables that belongs in a group with missing variables. For example, the Age Cohorts group contains 3 variables in our final dataset: A20_34, A35_49, A50_64. Thus, the group is currently missing variables for showing number of residents have age less than 20 or larger than 64. Hence, we remove these variables. In addition, we remove variables that we could not infer a definition from the original dataset.

# remove isolated variables
clean_data <- clean_data %>% subset(select = -c(A20_34, A35_49, A50_64, POP_16OV, POP_25OV, HS, BACH, HU_SNG_DET, HU_SNG_ATT,HU_2UN, HU_3_4UN, HV_LT_150K, HV_150_300K, HV_300_500K, HV_GT_500K, MED_HV, TOT_EMP_RES))

From the employment category variables, we are interested only in the average aggregate of the employment figures rather than specific number of employments figure in each top industry or locations due to ethical consideration. As such, we create average employment of top 5 most employed sectors of the CA residents variable, RES_NAICS_AVG; and average employment of top 5 most employed sectors in CA boundary, WORK_NAICS_AVG. We then process to remove specific variables.

# Create New variables for average employment of top 5 most employed sectors of CA residents and average employment of top 5 most employed sectors in CA boundary. 

# average employment of top 5 most employed sectors of CA residents
clean_data <- clean_data %>% 
  rowwise() %>% 
  mutate(RES_NAICS_AVG = sum(RES_NAICS1_COUNT, RES_NAICS2_COUNT, RES_NAICS3_COUNT, RES_NAICS4_COUNT, RES_NAICS5_COUNT, na.rm = T)/5)

# we are interested in the average value of the employment of top 5 most employed sectors of residents only.. Thus we remove all its components. 
clean_data <- clean_data %>% 
  subset(select = -c(RES_NAICS1_COUNT, RES_NAICS2_COUNT, RES_NAICS3_COUNT, RES_NAICS4_COUNT, RES_NAICS5_COUNT))

# average employment of top 5 most employed sectors in CA boundary. 
clean_data <- clean_data %>% 
  rowwise() %>% 
  mutate(WORK_NAICS_AVG = sum(WORK_NAICS1_COUNT, WORK_NAICS2_COUNT, WORK_NAICS3_COUNT, WORK_NAICS4_COUNT, WORK_NAICS5_COUNT, na.rm = T)/5)

# we are interested in the average value of the employment of top 5 most employed sectors in CA boundary only. Thus we remove all its components. 
clean_data <- clean_data %>% 
  subset(select = -c(WORK_NAICS1_COUNT, WORK_NAICS2_COUNT, WORK_NAICS3_COUNT, WORK_NAICS4_COUNT, WORK_NAICS5_COUNT))

# Remove resident_city_count & work_city_count variables
clean_data <- clean_data %>% 
  subset(select = -c(RES_CITY1_COUNT, RES_CITY2_COUNT, RES_CITY3_COUNT, RES_CITY4_COUNT, RES_CITY5_COUNT, WORK_CITY1_COUNT, WORK_CITY2_COUNT, WORK_CITY3_COUNT, WORK_CITY4_COUNT, WORK_CITY5_COUNT))

Since our analysis focuses only in numerical variables, thus, we remove all categorical variables except for the geo-code location and the year.

# remove categorical variables (all variables that are non-numeric or non-integer) except for the geo-code location and the year. 
clean_data <- clean_data[sapply(clean_data, function(x) is.numeric(x) || is.integer(x)) | colnames(clean_data) %in% c("GEOG", "Year")]

numerical_data <- clean_data %>%  # Take only numerical value
  select(-c(GEOG,Year))

head(clean_data)
## # A tibble: 6 × 64
## # Rowwise: 
##   GEOG       TOT_POP MED_AGE WHITE  HISP BLACK ASIAN OTHER POP_HH IN_LBFRC   EMP
##   <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl> <dbl>
## 1 Albany Pa…   52079    33.8 15349 25003  2215  8223  1289  50477    29624 26868
## 2 Archer He…   13266    32.0  2572 10109   114   388    82  13256     6594  5808
## 3 Armour Sq…   14068    42.0  1545   438  1348 10324   413  14058     6997  5919
## 4 Ashburn      42752    36.2  5555 15455 20817   284   641  42132    21985 19004
## 5 Auburn Gr…   45842    39.1   231   689 44402   117   403  45727    20187 14910
## 6 Austin       97643    34.7  4353 11132 81066   467   625  96495    41638 33519
## # ℹ 53 more variables: UNEMP <dbl>, NOT_IN_LBFRC <dbl>, WORK_AT_HOME <dbl>,
## #   TOT_COMM <dbl>, DROVE_AL <dbl>, CARPOOL <dbl>, TRANSIT <dbl>,
## #   WALK_BIKE <dbl>, COMM_OTHER <dbl>, INC_LT_25K <dbl>, INC_25_50K <dbl>,
## #   INC_50_75K <dbl>, INC_75_100K <dbl>, INC_100_150K <dbl>, INC_GT_150 <dbl>,
## #   MEDINC <dbl>, TOT_HH <dbl>, OWN_OCC_HU <dbl>, RENT_OCC_HU <dbl>,
## #   VAC_HU <dbl>, HU_TOT <dbl>, MED_ROOMS <dbl>, HA_AFT2000 <dbl>,
## #   HA_70_00 <dbl>, HA_40_70 <dbl>, HA_BEF1940 <dbl>, BR_0_1 <dbl>, …

Methodology

Exploratory Data Analysis

Housing Price Distribution

We would like to first would like the distribution of housing price in the Chicago market. More specifically, we visualize the distribution of housing prices in Chicago between 2017 to 2022.

# Define a list of years
unique_years <- unique(clean_data$Year)

# Define a color palette
colors <- RColorBrewer::brewer.pal(length(unique_years), "Set2")

# Initialize the plot
prices <- plot_ly()

# Determine an appropriate bin width
# This is something you might want to tweak to get the look you want
bin_width <- 200000  

# Add a trace for each year when changing the slider
for (i in seq_along(unique_years)) {
    year <- unique_years[i]
    prices <- prices %>% add_trace(
        data = subset(clean_data, Year == year),
        x = ~AveragePrice,
        type = 'histogram',
        visible = i == 1,  # Only the first trace is visible initially
        name = as.character(year),
        marker = list(color = colors[i]),
        xbins = list(start = min(clean_data$AveragePrice), end = max(clean_data$AveragePrice), size = bin_width)  # Setting bin width here
    )
}

# Define the steps for the slider
steps <- lapply(unique_years, function(year, index) {
    list(
        method = "restyle",
        args = list("visible", lapply(unique_years, function(y) y == year)),
        label = as.character(year)
    )
}, index = seq_along(unique_years))

# Add the slider and update layout
prices <- prices %>% layout(
    sliders = list(
        list(
            active = 0,
            currentvalue = list(prefix = "Year: "),
            steps = steps
        )
    ),
    title = "Distribution of Housing Prices Over Years",
    xaxis = list(
        title = "Average Price",
        # Set the range if you want to focus on a specific portion of the data
        range = c(min(clean_data$AveragePrice), max(clean_data$AveragePrice)) 
    ),
    yaxis = list(title = "Frequency"),
    hovermode = 'closest',
    font = list(size = 12),
    bargap = 0.1  # Adjust the gap between bars if needed
)

# Show the plot
prices

From the graph we see that the distribution of housing price in Chicago is right skewed for all years during the study periods. In addition, we show that the median and average housing price in 2017 to 2022 are $710,849, $1,027,310; and $1,098,684, $1,520,350 respectively. As such, we infer the housing price in Chicago has been growing during the 5 years period. In addition, we see that the increase in prices are not uniform throughout the city, due to the median housing price only increases by 45%, while the average price increases by 38% in the city. Hence, we infer lower than median and average price housing units have seen a larger increase in price compare to that of higher than median and average price housing units.

Correlation Matrix

In determining the impact of different CA’s features to housing price, we could not ignore the impact of such features onto each other. Thus, we have the correlation matrix including all variables that have a strong absolute correlation value of more than 0.75 with another variables. Thus, we have:

# Calculate the correlation matrix
corr_matrix <- cor(numerical_data)

# Set a high threshold for correlation
threshold <- 0.75
corr_matrix[abs(corr_matrix) < threshold] <- NA  # Set correlations below threshold to NA

# Function to check if all non-NA values are 1
all_one_or_na <- function(x) all(x[!is.na(x)] == 1)

# Remove columns that contain only 1 and NA
corr_matrix <- corr_matrix[, !apply(corr_matrix, 2, all_one_or_na)]

# Remove rows that contain only 1 and NA
corr_matrix <- corr_matrix[!apply(corr_matrix, 1, all_one_or_na), ]

# Identify variables with at least one significant correlation
significant_vars <- apply(!is.na(corr_matrix), 1, any)

# Filter out variables without significant correlations
corr_matrix_filtered <- corr_matrix[significant_vars, significant_vars]
diag(corr_matrix_filtered) <- NA  # Exclude self-correlations

# Create an interactive heatmap with the filtered matrix
fig_corr <- plot_ly(x = colnames(corr_matrix_filtered), y = colnames(corr_matrix_filtered), z = corr_matrix_filtered, 
                    type = "heatmap", colorscale = "Plasma", 
                    zmin = threshold, zmax = 1) %>%
            layout(title = "Filtered Correlation Matrix (Absolute Corr >= 0.75)",
                   xaxis = list(title = "Variables", tickangle = 45, tickfont = list(size = 10)),
                   yaxis = list(title = "Variables", tickangle = -45, tickfont = list(size = 10)),
                   margin = list(l = 100, r = 100, t = 100, b = 100))


fig_corr

Given our independent dataset having 62 variables, there are 48 variables that have at least 1 strong absolute correlation value with another variable. As such, since the majority of variables (around 77.42%) have at least 1 strong absolute correlation value with another variable. This indicates that CA’s features may have strong influences to each other. Moreover, due to the amount of strong correlation values between variables, we infer that there exists evidence of multicolinearity between independent variables.

Henceforth, we perform variable selection with the purpose of reducing multicolinearity. More specifically, our method involves performing hierarchical clustering to determine clusters of highly correlated variables and remove all variables within those clusters except for 1 variable. This guarantee no elimination of variables with high correlation but are not in the same cluster/group, thus, decreasing model multicolinearity without increasing bias.

# Calculate the initial correlation matrix
corr_matrix <- cor(numerical_data)

# Perform hierarchical clustering
abs_corr_matrix <- abs(corr_matrix)
diag(abs_corr_matrix) <- 0  # Exclude self-correlations
hc <- hclust(as.dist(1 - abs_corr_matrix), method = "average")

# Identify clusters of highly correlated variables (threshold: 0.9)
high_corr_threshold <- 0.9
clusters <- cutree(hc, h = 1 - high_corr_threshold)

# Select one variable from each cluster
vars_to_keep <- tapply(names(clusters), clusters, function(x) x[1])

# Update  numerical dataset by keeping only the selected variables
filtered_numerical_data <- numerical_data[, vars_to_keep, drop = FALSE]

Hence, our new filtered correlation matrix being

# Recalculate the correlation matrix with the filtered data
filtered_corr_matrix <- cor(filtered_numerical_data)

# Set a threshold for displaying significant correlations
threshold <- 0.75
filtered_corr_matrix[abs(filtered_corr_matrix) < threshold] <- NA
diag(filtered_corr_matrix) <- NA  # Exclude self-correlations

# Remove columns that contain only 1 and NA
filtered_corr_matrix <- filtered_corr_matrix[, !apply(filtered_corr_matrix, 2, all_one_or_na)]

# Remove rows that contain only 1 and NA
filtered_corr_matrix <- filtered_corr_matrix[!apply(filtered_corr_matrix, 1, all_one_or_na), ]

# Create an interactive heatmap with the filtered correlation matrix
fig_corr <- plot_ly(x = colnames(filtered_corr_matrix), y = colnames(filtered_corr_matrix), z = filtered_corr_matrix, 
                    type = "heatmap", colorscale = "Plasma", 
                    zmin = threshold, zmax = 1) %>%
            layout(title = "Filtered Correlation Matrix (Absolute Corr >= 0.75)",
                   xaxis = list(title = "Variables", tickangle = 45, tickfont = list(size = 10)),
                   yaxis = list(title = "Variables", tickangle = -45, tickfont = list(size = 10)),
                   margin = list(l = 100, r = 100, t = 100, b = 100))

fig_corr

Given variable selection based on initial correlation matrix, we now see that our filtered_numerical_data which contains all numerical independent variables after removing highly correlated variables have 44 variables. Of which, there are 28 variables that have at least 1 strong absolute correlation value with another variable. Although, the majority of variables (around 63.64%) still have at least 1 strong absolute correlation value with another variable. We believe that these correlation values do not indicate multicolinearity because variables that are still highly correlated are not of the same cluster/group during hierarchical clustering variable selection process.

We updated clean_data dataframe with filtered numerical variables in filtered_numerical_data dataframe.

# Remove numerical_data subset out of clean_data 
columns_to_keep <- setdiff(names(clean_data), names(numerical_data)) # Identify columns in clean_data that are not in numerical_data
clean_data <- clean_data[columns_to_keep] # Subset clean_data to keep only the columns that are not in numerical_data

# Add filted_numerical_data into clean_data
clean_data <- cbind(clean_data, filtered_numerical_data)

Analysis & Findings

Model Overviews

We will proceed with the following statistical analysis and model buildings:

  1. Lasso Regression: Since main dataset, clean_data, still contains 44. As such, we perform Lasso Regression. The methodology is used to perform variable selection and regularization by imposing a penalty on the absolute size of the coefficients. This can help in continue identifying the most significant predictors for our dataset.

  2. Principle Component Analysis (PCA) & K-means Clustering: In order to segmenting the housing market in Chicago, we proceed with performing PCA then applying K-means Clustering into the PCA results. More specifically, we perform PCA to reduce the dimensions of the data but still contain majority of information. The dimension reduction is achieved by identifying the principal components (PCs), which are uncorrelated linear combinations of the original variables that captured maximum variance. After determine the PCs, we apply K-Means Clustering to segment the dataset into distinct groups. Henceforth, K-means allows use to determine different segments of the Chicago Community Areas housing market and reveals intrinsic groupings among variables based on underlying structure.

  3. Decision Tree Regression: In order to determine main factors that change Chicago housing price, we apply Decision Tree Regression. We note that Decision tree method can be visualized and are relatively easy to understand. This is because the methodology provides clear decision rules for how inputs’ impact target variable. In addition, the methodology can also handle both quantitative, qualitative predictor variables, and missing data.

Lasso Regression

We perform Lasso Regression to continue variable selection and increase accuracy.

# Data Preparation for Lasso Regression
# 'Average_Price' is your dependent variable
target <- filtered_numerical_data$AveragePrice
features <- filtered_numerical_data %>% select_if(is.numeric) %>% select(-AveragePrice)
# Data preparation into x and y
x <- as.matrix(features)
y <- target
set.seed(123)
# Lasso regression with best lambda
lasso_model <- glmnet(x, y, alpha=1)
plot(lasso_model)

The resulting Lasso’s coefficient path plot provides visual representation of model’s coefficients as a function of L1 norm, reflecting the cumulative magnitude of the coefficients adjusted by the Lasso penalty, λ. We notice that given increasing λ values, coefficients’ values either stay constant at 0 or change from 0 to non-zero values. We continue with identifying the optimal λ value that minimize Mean Squared Error (MSE) value.

# To find the best lambda
cv_model <- cv.glmnet(x, y, alpha=1)
best_lambda <- cv_model$lambda.min
min_mse <- cv_model$lambda.1se

# Fit the LASSO model using the best lambda value
lasso_best <- glmnet(x, y, alpha = 1, lambda = best_lambda)

print(cv_model)
## 
## Call:  cv.glmnet(x = x, y = y, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index   Measure        SE Nonzero
## min     91    99 2.983e+11 3.000e+10      43
## 1se   7209    52 3.256e+11 3.909e+10      34
print(lasso_best)
## 
## Call:  glmnet(x = x, y = y, alpha = 1, lambda = best_lambda) 
## 
##   Df  %Dev Lambda
## 1 43 83.44  90.97

Our results indicate an optimal λ of 90.9702792 with cross validation showing that the minimum MSE of model being 7209.2390513. Overall, the model explains 83.44% of deviance in the dependent variable. In addition, since Lasso Regression possesses variable selection property by pushing coefficients of some variables to 0. Thus, we have a list of some non-zero coefficients:

# Extract the coefficients
coefficients <- coef(lasso_best, s = best_lambda)

# Coefficients are returned in a sparse matrix; convert to regular matrix for easier handling
coefficients <- as.matrix(coefficients)

# Identify non-zero coefficients (i.e., variables retained in the model)
non_zero_coefficients <- coefficients[coefficients != 0, , drop = FALSE]

# Convert into matrix form for significant variables 
non_zero_coefficients <- as.matrix(non_zero_coefficients)

# Extract the names of the variables with non-zero coefficients (excluding intercept)
selected_features <- colnames(x)[which(non_zero_coefficients[-1,] != 0)]

# Subset the original dataset to include only final features for PCA
final_filtered_data <- numerical_data[selected_features]

# show some non-zero coefficients
print(non_zero_coefficients, max = 6)
##                                s1
## (Intercept)          5.971136e+06
## TOT_POP             -1.763783e+02
## MED_AGE             -2.875071e+04
## WHITE               -2.480088e+01
## HISP                 8.058051e+00
## BLACK               -1.338189e-01
##  [ reached getOption("max.print") -- omitted 38 rows ]

We notice after Lasso Regression results, we identify 44 non_zero_coefficients. Thus, Lasso Regression reduce number of numerical variables from 44 to 44. Thus, we update our main dataset, clean_data.

# Remove numerical_data subset out of clean_data 
columns_to_keep <- setdiff(names(clean_data), names(final_filtered_data)) # Identify columns in clean_data that are not in final_filtered_data
clean_data <- clean_data[columns_to_keep] # Subset clean_data to keep only the columns that are not in final_filtered_data

# Add final_filtered_data into clean_data
clean_data <- cbind(clean_data, final_filtered_data)

Principle Component Analysis & K-Means Clustering

In order to determine different housing market segment in Chicago, we would implement PCA and K-Means Clustering. We first use PCA for our data.

final_filtered_data$AveragePrice <- clean_data$AveragePrice
# Perform PCA on the dataset with selected features
pca_result <- prcomp(final_filtered_data, scale. = TRUE, center=TRUE)

# Explained Variance
summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.9572 2.4972 2.1831 1.71909 1.63380 1.47221 1.35761
## Proportion of Variance 0.3559 0.1417 0.1083 0.06717 0.06067 0.04926 0.04189
## Cumulative Proportion  0.3559 0.4976 0.6059 0.67310 0.73377 0.78303 0.82492
##                            PC8     PC9   PC10   PC11    PC12    PC13    PC14
## Standard deviation     1.10429 1.00627 0.8520 0.8125 0.76706 0.73417 0.64069
## Proportion of Variance 0.02772 0.02301 0.0165 0.0150 0.01337 0.01225 0.00933
## Cumulative Proportion  0.85263 0.87565 0.8921 0.9072 0.92052 0.93277 0.94210
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.61365 0.51218 0.45090 0.44755 0.42444 0.40466 0.39469
## Proportion of Variance 0.00856 0.00596 0.00462 0.00455 0.00409 0.00372 0.00354
## Cumulative Proportion  0.95066 0.95662 0.96124 0.96579 0.96989 0.97361 0.97715
##                          PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.3571 0.33735 0.32665 0.29283 0.27840 0.26947 0.24765
## Proportion of Variance 0.0029 0.00259 0.00242 0.00195 0.00176 0.00165 0.00139
## Cumulative Proportion  0.9800 0.98263 0.98506 0.98701 0.98877 0.99042 0.99181
##                           PC29   PC30    PC31    PC32    PC33   PC34    PC35
## Standard deviation     0.24359 0.2295 0.20856 0.19623 0.18043 0.1629 0.15127
## Proportion of Variance 0.00135 0.0012 0.00099 0.00088 0.00074 0.0006 0.00052
## Cumulative Proportion  0.99316 0.9944 0.99535 0.99622 0.99696 0.9976 0.99809
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.14536 0.12757 0.12282 0.11391 0.10090 0.07629 0.04535
## Proportion of Variance 0.00048 0.00037 0.00034 0.00029 0.00023 0.00013 0.00005
## Cumulative Proportion  0.99857 0.99894 0.99928 0.99957 0.99981 0.99994 0.99998
##                           PC43      PC44
## Standard deviation     0.02636 6.831e-06
## Proportion of Variance 0.00002 0.000e+00
## Cumulative Proportion  1.00000 1.000e+00
# Scree plot to visualize variance explained by each principal component
fviz_eig(pca_result)

From the PCA results, we see that 90% of variance is explained by the first 10 PCs. As such, we believe implementing K-Means Clustering with these 10 PCs would maintain accuracy and increase interpretability of clusters. Thus, we implement K-Means Clustering with only these 10 PCs.

K-Means Clustering

We implement K-Means Clustering for 2 purposes: determine optimal number of segments for the housing market and obtain housing market segments characteristics.

In order to determine the optimal number of clusters, we utilize the Elbow method because of the methodology evaluating the within-cluster sum of squares and the separation distance between clusters. We should note that identification of appropriate number of clusters is crucial as it influences the accuracy of market segmentation and the interpretability of the results.

# Extract the scores for the first 10 principal components according to the Scree plot
pca_scores <- pca_result$x[, 1:10]

# Determine the optimal number of clusters (using the elbow method)
set.seed(123)
wss <- sapply(1:15, function(k){kmeans(pca_scores, k, nstart = 10)$tot.withinss})
plot(1:15, wss, type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K", ylab="Total within-clusters sum of squares")

# Perform K-means clustering with the chosen number of clusters

# Replace 'k' with the chosen number of clusters based on the elbow method

kmeans_result <- kmeans(pca_scores, centers = 3, nstart = 25)

From the plot above, it’s clear that the optimal number of clusters lie between 3 to 6 because of Total within clusters sum of squares value showing signs of flatten or remaining stagnant after. Thus, we proceed with the optimal number of clusters being 3. We decide on the smallest possible value of optimal clusters to minimize model’s complexity and increase the change of correspondence with low, middle, high tiers categories for increase economics interpretability of the housing market. As such, we proceed to assign observations’ cluster classification into final_filtered_data dataset.

# Assign the clusters back into the full original dataset (with year)
final_filtered_data $Clusters<-kmeans_result$cluster # assign the clusters into the full filtered dataset with year and location
final_filtered_data$Clusters <- as.factor(final_filtered_data$Clusters)
final_filtered_data$GEOG <- clean_data$GEOG
final_filtered_data$Year <-clean_data$Year
# Summary of avg housing price for each cluster
summary_table <- final_filtered_data %>% group_by(Clusters) %>% 
  summarise(
    Count = n(),
    Mean = mean(AveragePrice, na.rm = TRUE),
    SD = sd(AveragePrice, na.rm = TRUE),
    Min = min(AveragePrice, na.rm = TRUE),
    Median = median(AveragePrice, na.rm = TRUE),
    Max = max(AveragePrice, na.rm = TRUE)
  ) 


print(summary_table)
## # A tibble: 3 × 7
##   Clusters Count     Mean       SD      Min   Median      Max
##   <fct>    <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 1          163  859093.  642285.  166651.  754552. 3611389.
## 2 2           55 2997324. 1376572. 1573985. 2249095. 5809600.
## 3 3          208 1127029. 1121686.  166651.  753148. 5809600.

From the table results, we could reasonably device the housing market in Chicago into 3 categories: low, middle, and high segment. This is due to the drastic differences in average housing price between each segments with low, high, and middle segment having the average housing price of $859,093, $2,997,324, $1,127,029 respectively. In particular, we tested the difference in market segments by dive into segments’ total population characteristics. We believe that higher population areas tend to have high demand for housing and thus would drive housing price up. As such, we decice to highlight only CAs that belong to the upper 80 percentile and those in the lower 20 percentile in terms of total population.

# Define a color palette for the clusters
num_clusters <- length(unique(final_filtered_data$Clusters))
cluster_colors <- RColorBrewer::brewer.pal(num_clusters, "Set1")

# Get the unique years in the dataset
unique_years <- sort(unique(final_filtered_data$Year))

# Initialize the plot
scatter_plot <- plot_ly()
# Add a trace for each cluster (not each year and cluster combination)
for (j in seq_along(unique(final_filtered_data$Clusters))) {
  # Initialize with the first year's data for the cluster
  cluster_data <- filter(final_filtered_data, Year == unique_years[1], Clusters == as.factor(j))
  
  # Determine the quantiles for labeling
  top_bottom_quantile <- quantile(final_filtered_data$TOT_POP, probs = c(0.2, 0.8))
  final_filtered_data$Label <- ifelse(final_filtered_data$TOT_POP > top_bottom_quantile[2] | 
                               final_filtered_data$TOT_POP < top_bottom_quantile[1], 
                               as.character(final_filtered_data$GEOG), 
                               NA)

  scatter_plot <- scatter_plot %>% add_trace(
    data = final_filtered_data,
    x = ~TOT_POP,
    y = ~AveragePrice,
    type = 'scatter',
    mode = 'markers+text',
    name = paste("Cluster", j),
    marker = list(color = cluster_colors[j]),
    text = ~Label,
    textposition = 'top center',
    hoverinfo = 'text+x+y',
    showlegend = TRUE
  )
}

# Create the steps for the slider
steps <- vector("list", length(unique_years))
for (i in seq_along(unique_years)) {
  steps[[i]] <- list(
    method = "restyle",
    args = list(
      list(
        "x" = lapply(seq_along(unique(final_filtered_data$Clusters)), 
                     function(j) filter(final_filtered_data, Year == unique_years[i], Clusters == as.factor(j))$TOT_POP),
        "y" = lapply(seq_along(unique(final_filtered_data$Clusters)), 
                     function(j) filter(final_filtered_data, Year == unique_years[i], Clusters == as.factor(j))$AveragePrice),
        "text" = lapply(seq_along(unique(final_filtered_data$Clusters)),
                        function(j) {
                          year_cluster_data <- filter(final_filtered_data, Year == unique_years[i], Clusters == as.factor(j))
                          top_bottom_quantile <- quantile(year_cluster_data$TOT_POP, probs = c(0.2, 0.8))
                          ifelse(year_cluster_data$TOT_POP > top_bottom_quantile[2] | 
                                 year_cluster_data$TOT_POP < top_bottom_quantile[1], 
                                 as.character(year_cluster_data$GEOG), 
                                 NA)
                        })
      )
    ),
    label = as.character(unique_years[i])
  )
}

# Add the slider and update the layout
scatter_plot <- scatter_plot %>% layout(
  sliders = list(
    list(
      active = 0,
      currentvalue = list(prefix = "Year: "),
      steps = steps
    )
  ),
  title = "Clustering of Average Housing Prices by Population Over Years",
  xaxis = list(title = "Total Population"),
  yaxis = list(title = "Average Housing Price"),
  hovermode = 'closest'
)

# Show the plot
scatter_plot

The plot illustrates that from 2017 to 2022, Cluster 2 which represents the highest average housing prices segment also encompasses areas with larger populations, with West Town and Rogers Park being prominent examples. Conversely, Cluster 3, characterized by moderate housing prices, consistently correlates with regions of lower population density, notably including Washington Heights and West Pullman. Meanwhile, Cluster 1, which is associated with the lowest average housing prices, exhibits a broad spectrum of population totals, indicating a diverse range of population densities within this cluster with locations such as South Lanwdale or Portage Park.

Random Forest Regression

Finally, we utilize random forest regression in order to determine the most significance variables in impacting housing price.

control <- trainControl(method = "cv", number = 10) # train using 10-fold cross validation
# Define the tuning grid
tune_grid <- expand.grid(mtry = seq(2, length(selected_features), by = 5))
# Subset the data
forest_data <- final_filtered_data %>% 
  select(-c(Label, Year, GEOG))
set.seed(123)  # for reproducibility

# Train the model
rf_model_tuned <- train(AveragePrice ~ ., 
                        data = forest_data, 
                        method = "rf", 
                        trControl = control,
                        tuneGrid = tune_grid)


# Print the model summary
print(rf_model_tuned)
## Random Forest 
## 
## 426 samples
##  44 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 383, 382, 383, 384, 384, 384, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    252009.8  0.9700304  156531.5
##    7    223046.8  0.9739722  138351.3
##   12    209009.8  0.9765224  130572.0
##   17    200867.1  0.9775351  128056.5
##   22    203387.6  0.9764808  128640.2
##   27    193892.0  0.9780274  124636.8
##   32    193069.9  0.9779350  122834.4
##   37    189733.5  0.9781133  121689.9
##   42    183967.8  0.9794397  120063.8
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 42.
# Plot the model performance
plot(rf_model_tuned)

We implement Cross-validation to determine the optimal number of randomly selected predictors that minimize RMSE. Our results show that at 42 randomly selected predictors, the RMSE reaches a minimum of 1.8396782^{5}. In addition, we notice that the Random Forest model shows a moderately high R² value at 42 randomly selected predictors being 97.94%, thus, the model explained 97.94% of the variability in the response variable. However, it should be noted that since the model used 42 out of total 48 numerical variables, henceforth, we infer a risk of model’s over-fitting. Therefore, we further investigate model’s performance by looking at out-of-sample error rates. In addition, we determine the top 10 most significance variables from the model:

set.seed(123)  # For reproducibility
optimal_rf_model <- randomForest(AveragePrice ~ ., data = clean_data, mtry = 42, ntree = 200) # fit the full model with optimal mtry 

plot(optimal_rf_model) # plot the full model performance

# Get the important values in the RF model (node purity)
importance_values <- importance(optimal_rf_model)


# Create a data frame for plotting
importance_df <- data.frame(Variable = rownames(importance_values), 
                            Importance = importance_values[, "IncNodePurity"])

# Ordering variables by importance
importance_df <- importance_df[order(-importance_df$Importance),]

# Filter the top 10 most significant variables
top_10 <- importance_df %>%
  top_n(10, Importance) %>%
  arrange(desc(Importance))

# Create the ggplot plot
ggplot(top_10, aes(x = reorder(Variable, Importance), y = Importance, fill = Variable)) +
  geom_bar(stat = "identity", color = "black", size = 0.5) +
  coord_flip() +  # Flip coordinates for horizontal layout
  theme_minimal() +
  labs(title = "Top 10 Variable Importance in Random Forest Model", 
       x = "Variable", 
       y = "Increase in Node Purity") +
  theme(axis.text.y = element_text(size = 12),  # Adjust text size on the y-axis
        axis.title.x = element_blank(),  # Remove x-axis label
        legend.position = "none")   # Remove the legend

Based on the error plot, it appears that the model quickly levels off in performance given increasing number of trees. The results suggest that approximately 50 trees is adequate in achiving optimal error rate, further supporting our results of 42 randomly selected predictors.

Moreover, the model determines that top 10 most importance variables in predicting average housing price. We notice that the most importance variables are related to transportation categories, being: WALK_BIKE, TRANSperc, and TRANSIT. In addition, other variables also possesses significance being: Mfperc and WORK_AT_HOME. We should note that the importance scores is derived from node impurity during tree construction with increasing importance scores meaning improving node purity.

Conclusions & Policy Recommendations

Determinants of Housing Price

The variable importance plot provides insights into the factors that play a crucial role in determining housing prices. It appears that the variable ‘WALK_BIKE’ holds significance suggesting that amenities promoting walking and biking have a major influence on housing values. This could be indicative of a rising trend towards awareness and a preference for convenience and well being among Chicago residents.Additionally ‘TRANSperc’ and ‘TRANSIT’ are also identified as variables emphasizing the importance of transportation infrastructure when evaluating housing prices. This aligns with the idea that easy access to transport can boost property appeal and subsequently drive up housing prices. Moreover, they are related to racial and land-use demographics which emphasize the importance of these economic variables. For instance, multi-family and single-family demographics are ranked quite high in importance, indicating that the mix of multi-family and single-family homes in an area may reflect broader trends in housing demand and urban development patterns. The variable ‘INC_GT_150’ highlights how economic prosperity affects housing prices, indicating that neighborhoods with higher income households are associated with expensive real estate. Moreover the significance of the ‘VACANT’ variable underscores how supply impacts pricing dynamics as areas with vacant homes may experience different market conditions compared to those with fewer vacancies.

Housing Price Clustering Analysis

Given these findings from the table and example clustering plot, a recommendation for stakeholders and policymakers would be to focus on differentiated strategies for each cluster. For Cluster 2, enhancing existing infrastructure and local services could further increase property values, maintaining the high housing prices in these community areas. In Clusters 1 and 3 (with noticeably lower average housing prices), there is an opportunity to develop housing policies that support affordability while also considering interventions to stabilize and increase property values, such as community development projects or incentivizing new local businesses to address the diverse needs of these communities.

Potential Limitations

  • Because we are analyzing the Chicago housing market in terms of community areas which assume the ones in the same area had similar housing prices, this could lead to inaccurate assumptions and interpretations. This might overlook the diversity in property types, conditions, and micro-locations within the same community area. The real estate market often has significant variations even within small geographic areas.
  • Risk of overfitting due to small number of data points and high accuracy of the random forest model also exemplify this concern.
  • Inability to determine the exact impact of the price determinants due to the use of node purity in random forest model instead of using the coefficients of the variables. This makes it challenging to understand the exact influence of each variable on housing prices.

Appendix

The table below shows a detailed descriptions of variables:

# Import variables description metadata. 
Variables_Description_Metadata <- read_excel("Dataset/Variables Description Metadata.xlsx")

# Create a table with caption for variables description. 
table <- kable(Variables_Description_Metadata, format = "html", caption = "Variable Definitions: Author's creation using CMAP dataset")
pretty_table <- table %>%
                kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                              full_width = F)

# Appendix table
pretty_table
Variable Definitions: Author’s creation using CMAP dataset
Variables Names Description
GEOG location
TOT_POP total pop
POP_HH TOTAL NO. OF HOUSEHOLDS
A20_34 Counts of residents within the CA that lies within the age of 20 to 34.
A35_49 Counts of residents within the CA that lies within the age of 35 to 49.
A50_64 Counts of residents within the CA that lies within the age of 50 to 64.
MED_AGE The median age of the residents in CA.
WHITE The number of residents in CA that have racial category as White, Non-Hispanic
HISP The number of residents in CA that have racial category as Hispanic or Latino of any race.
BLACK The number of residents in CA that have racial category as Black, Non-Hispanic
ASIAN The number of residents in CA that have racial category as Asian, Non-Hispanic
OTHER The number of residents in CA that have racial category as Other/Multiple Races, Non-Hispanic
POP_16OV No Definition founded, variable removed during analysis
IN_LBFRC Counts of residents within the CA that are categorized as in the labor force
EMP Counts of residents within the CA that are categorized as in the labor force and employed
UNEMP Counts of residents within the CA that are categorized as in the labor force and unemployed
NOT_IN_LBFRC Counts of residents within the CA that are categorized as not in the labor force
WORK_AT_HOME Counts of residents within the CA that work at home rather than travel to work
TOT_COMM Total number of residents within the CA that commute to work
DROVE_AL Counts of residents within the CA that drive alone to work
CARPOOL Counts of residents within the CA that carpool to work
TRANSIT Counts of residents within the CA that use public transportation to work
WALK_BIKE Counts of residents within the CA that use a bike or walk to work
COMM_OTHER Counts of residents within the CA that use other mode of travel to work
POP_25OV No Definition founded, variable removed during analysis
HS Counts of residents within the CA that have a high school diploma or equivalent
BACH Counts of residents within the CA that have a bachelor’s degree
INC_LT_25K Counts of households within the CA that have an income less than $25,000
INC_25_50K Counts of households within the CA that have an income between $25,000 and $50,000
INC_50_75K Counts of households within the CA that have an income between $50,000 and $75,000
INC_75_100K Counts of households within the CA that have an income between $75,000 and $100,000
INC_100_150K Counts of households within the CA that have an income between $100,000 and $150,000
INC_GT_150 Counts of households within the CA that have an income more than $150,000
MEDINC The median household income within the CA
TOT_HH Total number of Occupied Housing Units
OWN_OCC_HU Total number of Owner Occupied Housing Units
RENT_OCC_HU Total number of Renter Occupied Units
VAC_HU Total number of Vacant Housing Units
HU_TOT Total number of Housing Units
HU_SNG_DET Counts of Housing Units for a single Family, detached
HU_SNG_ATT Counts of Housing Units for a single Family, attached
HU_2UN Counts of Housing Units that have 2 within units
HU_3_4UN Counts of Housing Units that have 3 to 4 within units
HA_AFT2000 Counts of housing units that are built after 2000
HA_70_00 Counts of housing units that are built between 1970 and 2000
HA_40_70 Counts of housing units that are built between 1940 and 1970
HA_BEF1940 Counts of housing units that are built before 1940
MED_HA Median year built of housing units withIn CA
BR_0_1 Counts of housing units that have 0 to 1 bedrooms
BR_2 Counts of housing units that have 2 bedrooms
BR_3 Counts of housing units that have 3 bedrooms
BR_4 Counts of housing units that have 4 bedrooms
BR_5 Counts of housing units that have 5 bedrooms
MED_ROOMS Median number of rooms within a housing unit in the CA
HV_LT_150K No Definition founded, variable removed during analysis
HV_150_300K No Definition founded, variable removed during analysis
HV_300_500K No Definition founded, variable removed during analysis
HV_GT_500K No Definition founded, variable removed during analysis
MED_HV No Definition founded, variable removed during analysis
TOT_EMP_RES No Definition founded, variable removed during analysis
RES_NAICS1_TYPE The largest employer sector of CA residents
RES_NAICS2_TYPE The 2nd largest employer sector of CA residents
RES_NAICS3_TYPE The 3rd largest employer sector of CA residents
RES_NAICS4_TYPE The 4th largest employer sector of CA residents
RES_NAICS5_TYPE The 5th largest employer sector of CA residents
RES_NAICS1_COUNT Number of employment of the largest employer sector of CA residents
RES_NAICS2_COUNT Number of employment of the 2nd largest employer sector of CA residents
RES_NAICS3_COUNT Number of employment of the 3rd largest employer sector of CA residents
RES_NAICS4_COUNT Number of employment of the 4th largest employer sector of CA residents
RES_NAICS5_COUNT Number of employment of the 5th largest employer sector of CA residents
TOT_EMP_WORK Total number of employment of top 5 industry sector with largest employed population from CA
WORK_NAICS1_TYPE The largest employer sector in CA boundary
WORK_NAICS2_TYPE The 2nd largest employer sector in CA boundary
WORK_NAICS3_TYPE The 3rd largest employer sector in CA boundary
WORK_NAICS4_TYPE The 4th largest employer sector in CA boundary
WORK_NAICS5_TYPE The 5th largest employer sector in CA boundary
WORK_NAICS1_COUNT Number of employment of the largest employer sector in CA boundary
WORK_NAICS2_COUNT Number of employment of the 2nd largest employer sector in CA boundary
WORK_NAICS3_COUNT Number of employment of the 3rd largest employer sector in CA boundary
WORK_NAICS4_COUNT Number of employment of the 4th largest employer sector in CA boundary
WORK_NAICS5_COUNT Number of employment of the 5th largest employer sector in CA boundary
RES_CITY1_TYPE Top 1 employment locations of CA residents
RES_CITY2_TYPE Top 2 employment locations of CA residents
RES_CITY3_TYPE Top 3 employment locations of CA residents
RES_CITY4_TYPE Top 4 employment locations of CA residents
RES_CITY5_TYPE Top 5 employment locations of CA residents
RES_CITY1_COUNT Number of employment of top 1 employment locations of CA residents
RES_CITY2_COUNT Number of employment of top 2 employment locations of CA residents
RES_CITY3_COUNT Number of employment of top 3 employment locations of CA residents
RES_CITY4_COUNT Number of employment of top 4employment locations of CA residents
RES_CITY5_COUNT Number of employment of top 5 employment locations of CA residents
WORK_CITY1_TYPE Location with most employed people in the CA
WORK_CITY2_TYPE Location with 2nd most employed people in the CA
WORK_CITY3_TYPE Location with 3rd most employed people in the CA
WORK_CITY4_TYPE Location with 4th most employed people in the CA
WORK_CITY5_TYPE Location with 5th most employed people in the CA
WORK_CITY1_COUNT Number of employment of location with most employed people in the CA
WORK_CITY2_COUNT Number of employment of location with 2nd most employed people in the CA
WORK_CITY3_COUNT Number of employment of location with 3rd most employed people in the CA
WORK_CITY4_COUNT Number of employment of location with 4th most employed people in the CA
WORK_CITY5_COUNT Number of employment of location with 5th most employed people in the CA
AVG_VMT Average Vehicle Miles Traveled per Year of household in CA
RET_SALES No Definition founded, variable removed during analysis
RES_EAV No Definition founded, variable removed during analysis
CMRCL_EAV No Definition founded, variable removed during analysis
IND_EAV No Definition founded, variable removed during analysis
RAIL_EAV No Definition founded, variable removed during analysis
FARM_EAV No Definition founded, variable removed during analysis
MIN_EAV No Definition founded, variable removed during analysis
TOT_EAV No Definition founded, variable removed during analysis
OPEN_SPACE_PER_1000 Accessible Park Acreage per 1,000 Residents in the CA
TOT_ACRES Total Acres of land in the CA
SF Total acres of land in the CA for single family residential
Sfperc Percent of total acres of land in the CA for single family residential
MF Total acres of land in the CA for multi-family residential
Mfperc Percent of total acres of land in the CA for multi-family residential
MIX Total acres of land in the CA for mixed use purpose
MIXperc Percent of total acres of land in the CA for mixed use purpose
COMM Total acres of land in the CA for commercial purpose
COMMperc Percent of total acres of land in the CA for commercial purpose
INST Total acres of land in the CA for institutional purpose
INSTperc Percent of total acres of land in the CA for institutional purpose
IND Total acres of land in the CA for industrial purpose
INDperc Percent of total acres of land in the CA for industrial purpose
TRANS Total acres of land in the CA for transportation and Other
TRANSperc Percent of total acres of land in the CA for transportation and Other
AG Total acres of land in the CA for agriculture purpose
Agperc Percent of total acres of land in the CA for agriculture purpose
OPEN Total acres of land in the CA that are open space
OPENperc Percent of total acres of land in the CA that are open space
VACANT Total acres of land in the CA that are vacant
VACperc Percent of total acres of land in the CA that are vacant
Year The year of the dataset